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Abstract 

This paper describes an experimental com- 
parison of three unsupervised learning al- 
gorithms that distinguish the sense of 
an ambiguous word in untagged text. 
The methods described in this paper, 
McQuitty's similarity analysis, Ward's 
minimum-variance method, and the EM 
algorithm, assign each instance of an am- 
biguous word to a known sense definition 
based solely on the values of automatically 
identifiable features in text. These meth- 
ods and feature sets are found to be more 
successful in disambiguating nouns rather 
than adjectives or verbs. Overall, the most 
accurate of these procedures is McQuitty's 
similarity analysis in combination with a 
high dimensional feature set. 

1 Introduction 

Statistical methods for natural language process- 
ing are often dependent on the availability of costly 
knowledge sources such as manually annotated text 
or semantic networks. This limits the applicability 
of such approaches to domains where this hard to 
acquire knowledge is already available. This paper 
presents three unsupervised learning algorithms that 
are able to distinguish among the known senses (i.e., 
as defined in some dictionary) of a word, based only 
on features that can be automatically extracted from 
untagged text. 

The object of unsupervised learning is to deter- 
mine the class membership of each observation (i.e. 
each object to be classified), in a sample without us- 
ing training examples of correct classifications. We 
discuss three algorithms, McQuitty's similarity anal- 
ysis (McQuitty, 1966), Ward's minimum-variance 



distinguish among the known senses of an ambigu- 
ous word without the aid of disambiguated exam- 
ples. The EM algorithm produces maximum likeli- 
hood estimates of the parameters of a probabilistic 
model, where that model has been specified in ad- 
vance. Both Ward's and McQuitty's methods are ag- 
glomerative clustering algorithms that form classes 
of unlabeled observations that minimize their respec- 
tive distance measures between class members. 

The rest of this paper is organized as follows. 
First, we present introductions to Ward's and Mc- 
Quitty's methods (Section 2) and the EM algorithm 
(Section 3). We discuss the thirteen words (Section 
4) and the three feature sets (Section 5) used in our 
experiments. We present our experimental results 
(Section 6) and close with a discussion of related 
work (Section 7). 

2 Agglomerative Clustering 

In general, clustering methods rely on the assump- 
tion that classes occupy distinct regions in the fea- 
ture space. The distance between two points in a 
multi-dimensional space can be measured using any 
of a wide variety of metrics 
1982; )) 



and Kittler, 



method ( Ward, 1963 ) and the EM algorithm ( Demp 



;ter, Laird, and Rubin, 1977), that can be used to 



see, e.g. (Devijver 
Observations are grouped in 
the manner that minimizes the distance between the 
members of each class. 

Ward's and McQuitty's method are agglomerative 
clustering algorithms that differ primarily in how 
they compute the distance between clusters. All 
such algorithms begin by placing each observation 
in a unique cluster, i.e. a cluster of one. The two 
closest clusters are merged to form a new cluster 
that replaces the two merged clusters. Merging of 
the two closest clusters continues until only some 
specified number of clusters remain. 

However, our data does not immediately lend it- 
self to a distance-based interpretation. Our features 
represent part-of-speech (POS) tags, morphological 
characteristics, and word co-occurrence; such fea- 
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Figure 1: Matrix of Feature Values 



that cluster (i.e., the average of all the observations 
in the cluster). At each step in Ward's method, a 
new cluster, Ckl, with the smallest possible inter- 
nal variance, is created by merging the two clusters, 
Ck and Cl, that have the minimum variance be- 
tween them. The variance between Ck and Cl is 
computed as follows: 
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Figure 2: Dissimilarity Matrix 



tures are nominal and their values do not have scale. 
Given a POS feature, for example, we could choose 
Timm = 1 , vp.rh = 2 J adjective = 3, and adverb = 



i That adverb is represented hy a. larger number 

than noun is purely coincidental and implies nothing 
about the relationship between nouns and adverbs. 

Thus, before we employ either clustering algo- 
rithm, we represent our data sample in terms of a 
dissimilarity matrix. Suppose that we have iV ob- 
servations in a sample where each observation has q 
features. This data is represented in a N x N dis- 
similarity matrix such that the value in cell 
where i represents the row number and j represents 
the column, is equal to the number of features in 
observations i and j that do not match. 

For example, in Figure [l] we have four observa- 
tions. We record the values of three nominal fea- 
tures for each observation. This sample can be rep- 
resented by the 4x4 dissimilarity matrix shown in 
Figure ^. In the dissimilarity matrix, cells (1,2) and 
(2, 1) have the value 2, indicating that the first and 
second observations in Figure [j] have different values 
for two of the three features. A value of indicates 
that observations i and j are identical. 

When clustering our data, each observation is rep- 
resented by its corresponding row (or column) in the 
dissimilarity matrix. Using this representation, ob- 
servations that fall close together in feature space are 
likely to belong to the same class and are grouped 
together into clusters. In this paper, we use Ward's 
and McQuitty's methods to form clusters of obser- 
vations, where each observation is represented by a 
row in a dissimilarity matrix. 

2.1 Ward's minimum variance method 

In Ward's method, the internal variance of a cluster 
is the sum of squared distances between each obser- 
vation in the cluster and the mean observation for 
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where xk is the mean observation for cluster Ck, 
Nk is the number of observations in C'k, and xl 
and Nl arc defined similarly for Cl- 

Implicit in Ward's method is the assumption that 
the sample comes from a mixture of normal distri- 
butions. While NLP data is typically not well char- 
acterized by a normal distribution (see, e.g. ( [Zipf 



1935), (Pedersen, Kayaalp, and Bruce, 1996)), there 



is evidence that our data, when represented by a dis- 
similarity matrix, can be adequately characterized 
by a normal distribution. However, we will continue 
to investigate the appropriateness of this assump- 
tion. 

2.2 McQuitty's similarity analysis 

In McQuitty's method, clusters are based on a sim- 
ple averaging of the feature mismatch counts found 
in the dissimilarity matrix. 

At each step in McQuitty's method, a new cluster, 
Ckl, is formed by merging the clusters Ck and Cl 
that have the fewest number of dissimilar features 
between them. The clusters to be merged, Ck and 
Cl, are identified by finding the cell (I, k) (or (k, I)), 
where k ^ /, that has the minimum value in the 
dissimilarity matrix. 

Once the new cluster Ckl is created, the dissim- 
ilarity matrix is updated to reflect the number of 
dissimilar features between Ckl and all other exist- 
ing clusters. The dissimilarity between any existing 
cluster Ci and Ckl is computed as: 
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where D ki is the number of dissimilar features be- 
tween clusters Ck and Cj and Dli is similarly de- 
fined for clusters Cl and Cj. This is simply the 
average number of mismatches between each com- 
ponent of the new cluster and the existing cluster. 

Unlike Ward's method, McQuitty's method makes 
no assumptions concerning the distribution of the 
data sample. 



3 EM Algorithm 



The expectation maximization algorithm (Dcmp 
3ter, Laird, and Rubin, 1977), commonly known as 
the EM algorithm, is an iterative estimation proce- 
dure in which a problem with missing data is recast 
to make use of complete data estimation techniques. 
In our work, the sense of an ambiguous word is rep- 
resented by a feature whose value is missing. 

In order to use the EM algorithm, the paramet- 
ric form of the model representing the data must 
be known. In these experiments, we assume that 



the model form is the Naive Bayes (Duda and 
Hart, 1972 ). In this model, all features are con- 



ditionally independent given the value of the clas- 
sification feature, i.e., the sense of the ambigu- 
ous word. This assumption is based on the suc- 
cess of the Naive Bayes model when applied to su- 
Derviscd word-sense disambiguation (c.g 



Church, and Yarowsky, 1992), (Lcacock, Towell, and 



flGale 



Voorhccs, 1993|), ( Mooney, 1996), (Pederscn, Bruce 



and Wicbc, 1997| ), ( Pedcrscn and Bruce, 1997a| )) 



There are two potential problems when using the 
EM algorithm. First, it is computationally expen- 
sive and convergence can be slow for problems with 
large numbers of model parameters. Unfortunately 
there is little to be done in this case other than re- 
ducing the dimensionality of the problem so that 
fewer parameters are estimated. Second, if the like- 
lihood function is very irregular it may always con- 
verge to a local maxima and not find the global max- 
imum. In this case, an alternative is to use the more 
computationally expensive method of Gibbs Sam- 



pling (Gcman and Gcman, 1984) 



3.1 Description 

At the heart of the EM Algorithm lies the Q- 
function. This is the expected value of the log- 
likelihood function for the complete data D = (Y, S), 
where Y is the observed data and S is the missing 
sense value: 



The E-step finds the expected values of the sufficient 
statistics of the complete model using the current es- 
timates of the model parameters. The M-step makes 
maximum likelihood estimates of the model param- 
eters using the sufficient statistics from the E-step. 
These steps iterate until the parameter estimates 9 
and 9 l converge. 

The M-step is usually easy, assuming it is easy 
for the complete data problem; the E-step is not 
necessarily so. However, for decomposable models, 
such as the Naive Bayes, the E-step simplifies to the 
calculation of the expected counts in the marginal 
distributions of interdependent features, where the 
expectation is with respect to 9. The M-step sim- 
plifies to the calculation of new parameter estimates 
from these counts. Further, these expected counts 
can be calculated by multiplying the sample size N 
by the probability of the complete data within each 
marginal distribution given 9 and the observed data 
within each marginal Y m . This simplifies to: 

count 1 (S m ,Y m ) = P(S m \Y m ) x count(Y m ) 

where count 1 is the current estimate of the expected 
count and P(S m \Y m ) is formulated using 9. 

3.2 Example 

For the Naive Bayes model with 3 observable fea- 
tures A, B, C and an unobservable classification fea- 
ture S, where 9 = {P(a,s),P(b,s),P(c,s),P(s)}, 
the E and M-steps are: 

1. E-step: The expected values of the sufficient 
statistics are computed as follows: 

count l {s,a) — P(s\a) x count{a) 
count l (s,b) = P(s\b) x count(b) 
count x {s,c) = P{s\c) x count{c) 
count l (s) = ^""^ {P(s\a, b, c) x count (a, b, c)} 



Q(9 l \9) = E[\np(Y,S\9 l )\9,Y)} 



(3) 



Here, 9 is the current value of the maximum likeli- 
hood estimates of the model parameters and 9 l is the 
improved estimate that we are seeking; p(Y, S^ 1 ) is 
the likelihood of observing the complete data given 
the improved estimate of the model parameters. 

When approximating the maximum of the likeli- 
hood function, the EM algorithm starts from a ran- 
domly generated initial estimate of 9 and then re- 
places 9 by the 9 % which maximizes Q(9 % \9). This 
process is broken down into two steps: expecta- 
tion (the E-step), and maximization (the M-step). 



where: 



P(s, a, 6, c) = 
P(a,b,c)=J2 



P(s\a)=J2 p (s\a,b,c) 



P(s\a, b, c) 



P{s, a, b, c) 



P(a,b,c) 
P(s,a) x P(s,b) x P(s,c) 
P(s)2 

P(s,a) x P(s,b) x P(s, c) 
Pjsj* 



2. M-step: The sufficient statistics from the E- 1995]). This data is described in m ore detail in (L< 



step arc used to re-estimate the model param- 



eters 6 l 



P l (s,b) = 
P\s,c) = 



count* (s, a) 
N 

count 1 (s, b) 
N 

count 1 (s, c) 
N 

count 1 (s) 
= N 



where s, a, b, and c denote specific values of S, A, B, 
and C respectively, and P{s\b) and P(s\c) are de- 
fined analogously to P(s\a). 

4 Experimental Procedure 

Experiments were conducted to disambiguate 13 dif- 
ferent words using 3 different feature sets. In these 
experiments, each of the 3 unsupervised disambigua- 
tion methods is applied to each of the f 3 words using 
each of the 3 feature sets; this defines a total of 117 
different experiments. In addition, each experiment 
was repeated 25 times in order to study the variance 
introduced by randomly selecting initial parameter 
estimates, in the case of the EM algorithm, and ran- 
domly selecting among equally distant groups when 
clustering using Ward's and McQuitty's methods. 

In order to evaluate the unsupervised learning al- 
gorithms we use sense-tagged text in these exper- 
iments. However, this text is only used to evalu- 
ate the accuracy of our methods. The classes dis- 
covered by the unsupervised learning algorithms are 
mapped to dictionary senses in a manner that max- 
imizes their agreement with the sense-tagged text. 
If the sense-tagged text were not available, as would 
often be the case in an unsupervised experiment, this 
mapping would have to be performed manually. 

The words disambiguated and their sense distri- 
butions are shown in Figure^. All data, with the ex- 
ception of the data for line, c ome from the ACL/DCI 
Wall Street Journal c orpus ( [Marcus, Santorini, and 
Marcinkicwicz, 1993 ). With the exception of line, 



each ambiguous word is tagged with a single sense 
defined in the Longman Dictionary of Contempo- 
rary English (LDOCE) ([Procter, 197^ ). The data 
for the 12 words tagged u sing LDOCE senses are 



described in more detail in ( Bruce, Wiebe, and Pcd- 
ersen, 1996) 



The line data comes from both the ACL/DCI 
WSJ corpus and the American Printing House for 
the Blind corpus. Each occurrence of line is tagged 
with a single sense defined in WordNet (Miller 



cock, Towell, and Voorhecs, 1993). 

Every experiment utilizes all of the sentences 
available for each word. The number of sentences 
available per word is shown as "total count" in Fig- 
ure ||. We have reduced the sense inventory of these 
words so that only the two or three most frequent 
senses are included in the text being disambiguated. 
For several of the words, there are minority senses 
that form a very small percentage (i.e., < 5%) of 
the total sample. Such minority classes are not yet 
well handled by unsupervised techniques; therefore 
we do not consider them in this study. 

5 Feature Sets 

We define three different feature sets for use in these 
experiments. Our objective is to evaluate the effect 
that different types of features have on the accuracy 
of unsupervised learning algorithms such as those 
discussed here. We are particularly interested in the 
impact of the overall dimensionality of the feature 
space, and in determining how indicative different 
feature types are of word senses. Our feature sets are 
composed of various combinations of the following 
five types of features. 

Morphology The feature M represents the mor- 
phology of the ambiguous word. For nouns, M is 
binary indicating singular or plural. For verbs, the 
value of M indicates the tense of the verb and can 
have up to 7 possible values. This feature is not used 
for adjectives. 

Part— of— Speech Features of the form Phi repre- 
sent the part-of-speech (POS) of the word i posi- 
tions to the left of the ambiguous word. PRi repre- 
sents the POS of the word i positions to the right. 
In these experiments, we used 4 POS features, PL\, 
PL2, PRi, and PR2 to record the POS of the words 
1 and 2 positions to the left and right of the am- 
biguous word. Each POS feature can have one of 
5 possible values: noun, verb, adjective, adverb or 
other. 

Co— occurrences Features of the form C, are bi- 
nary co-occurrence features. They indicate the pres- 
ences or absences of a particular content word in the 
same sentence as the ambiguous word. We use 3 bi- 
nary co-occurrence features, C\, C2, and C3 to rep- 
resent the presences or absences of each of the three 
most frequent content words, C\ being the most fre- 
quent content word, C2 the second most frequent 
and C3 the third. Only sentences containing the am- 
biguous word were used to establish word frequen- 
cies. 



Adjective Senses 
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highest in rank: 

most important; main: 


86% 
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belonging to or shared by 2 or more: 
happening often; usual: 


84% 
8% 
8% 


last: (total count: 3004) 

on the occasion nearest in the past: 

after all others: 


94% 
6% 


public: (total count: 715) 
concerning people in general: 
concerning the government and people: 
not secret or private: 


68% 
19% 
13% 


Noun Senses 


bill: (total count: 1341) 
a proposed law under consideration: 
a piece of paper money or treasury bill: 
a list of things bought and their price: 


68% 
22% 
10% 


concern: (total count: 1235) 
a business; firm: 
worry; anxiety: 


64% 
36% 


drug: (total count: 1127) 

a medicine; used to make medicine: 

a habit-forming substance: 


57% 
43% 


interest: (total count: 2113) 
money paid for the use of money: 
a share in a company or business: 
readiness to give attention: 


59% 
24% 
17% 


line: (total count: 1149) 
a wire connecting telephones: 
a cord; cable: 
an orderly series: 


37% 
32% 
30% 


Verb Senses 


agree: (total count: 1109) 

to concede after disagreement: 

to share the same opinion: 


74% 
26% 


close: (total count: 1354) 

to (cause to) end: 

to (cause to) stop operation: 


77% 

23% 


help: (total count: 1267) 

to enhance - inanimate object: 

to assist - human object: 


78% 
22% 


include: (total count: 1526) 

to contain in addition to other parts: 

to be a part of - human subject: 


91% 
9% 


Figure 3: Distribution of Senses 
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Figure 4: Co-occurrence Features 



Frequency based features like this one contain lit- 
tle information about low frequency classes. For 
words with skewed sense distribution, it is likely that 
the most frequent content words will be associated 
only with the dominate sense. 

As an example, consider the 3 most frequent con- 
tent words occurring in the sentences that contain 
chief, officer, executive and president. Chief has a 
majority class distribution of 86% and, not surpris- 
ingly, these three content words are all indicative of 
the dominate sense which is "highest in rank" . 

The set of content words used in formulating the 
co-occurrence features are shown in Figure ^. Note 
that million and company occur frequently. These 
are not likely to be indicative of a particular sense 
but more reflect the general nature of the Wall Street 
Journal corpus. 

Unrestricted Collocations Features of the form 
ULi and URi indicate the word occurring in the po- 
sition i places to the left or right, respectively, of the 
ambiguous word. All features of this form have 21 
possible values. Nineteen correspond to the 19 most 
frequent words that occur in that fixed position in 
all of the sentences that contain the particular am- 
biguous word. There is also a value, (none), that 
indicates when the position i to the left or right is 
occupied by a word that is not among the 19 most 
frequent, and a value, (null), indicating that the po- 
sition i to the left or right falls outside of the sentence 
boundary. 

In these experiments we use 4 unrestricted collo- 
cation features, ULi2,ULi,URi, and UR2- As an 
example, the values of these features for concern are 
as follows: 



• UL2: and, the, a, of, to, financial, have, be- 



cause, an, 's, real, cause, calif., york, u.s., other, 
mass., german, (null), (none) 

• UL\ : the, services, of, products, banking, 's, 
pharmaceutical, energy, their, expressed, elec- 
tronics, some, biotechnology, aerospace, en- 
vironmental, such, japanese, gas, investment, 
(null), (none) 

• UR\. about, said, that, over, 's, in, with, had, 
are, based, and, is, has, was, to, for, among, 
will, did, (null), (none) 

• UR 2 : the, said, a, it, in, that, to, n't, is, which, 
by, and, was, has, its, possible, net, but, annual, 
(null), (none) 

Content Collocations Features of the form CL\ 
and CR\ indicate the content word occurring in the 



possible values for M varies with the part-of-speech 
of the ambiguous word. The lower number is asso- 
ciated with adjectives and the higher with verbs. 

To get a feeling for the adequacy of these feature 
sets, we performed supervised learning experiments 
with the interest data using the Naive Bayes model. 
We disambiguated 3 senses using a 10:1 training-to- 
test ratio. The average accuracies for each feature 
set over 100 random trials were as follows: A 80.9%, 
B 87.7%, and C 82.7%. 

The window size, the number of values for the 
POS features, and the number of words considered 
in the collocation features are kept deliberately small 
in order to control the dimensionality of the prob- 
lem. In future work, we will expand all of the above 
types of features and employ techniques to reduce 



position 1 place to the loft or right , respectively, of 



the ambiguous word , — The values of those features 

are defined much like the unrestricted collocations 
above, except that these are restricted to the 19 most 
frequent content words that occur only one position 
to the left or right of the ambiguous word. 

To contrast this set of features with the unre- 
stricted collocations, consider concern again. The 
values of the features representing the 19 most fre- 
quent content words 1 position to the left and right 
are as follows: 

• CL\: services, products, banking, pharmaceu- 
tical, energy, expressed, electronics, biotechnol- 
ogy, aerospace, environmental, japanese, gas, 
investment, food, chemical, broadcasting, u.s., 
industrial, growing, (null), (none) 

• CR\: said, had, are, based, has, was, did, 
owned, were, regarding, have, declined, ex- 
pressed, currently, controlled, bought, an- 
nounced, reported, posted, (null), (none) 

Feature Sets A, B and C The 3 feature sets 
used in these experiments are designated A, B and 
C and are formulated as follows: 

• A: M,PL 2 ,PL 1 ,PR 1 ,PR 2 ,C 1 ,C 2 ,C 3 
Dimensionality: 5,000 - 35,000 

• B: M,UL 2 ,UL 1 ,UR 1 ,UR 2 
Dimensionality: 194,481 - 1,361,367 

• C: M,PL 2 ,PL 1 ,PR 1 ,PR 2 ,CL 1 ,CR 1 
Dimensionality: 275,625 - 1,929,375 

The dimensionality is the number of possible com- 
binations of feature values and thus the size of the 
feature space. These values vary since the number of 



1995) 



dimensionality along the lines suggested in (Duda 
and Hart, 1973j ) and (Gale, Church, and Yarowsky. 



6 Experimental Results 

Figure |B| shows the average accuracy and standard 
deviation of disambiguation over 25 random trials 
for each combination of word, feature set and learn- 
ing algorithm. Those cases where the average accu- 
racy of one algorithm for a particular feature set 
is significantly higher than another algorithm, as 
judged by the t-test (p=.01), are shown in bold face. 
For each word, the most accurate overall experiment 
(i.e., algorithm/feature set combination), and those 
that are not significantly less accurate are under- 
lined. Also included in Figure || is the percentage of 
each sample that is composed of the majority sense. 
This is the accuracy that can be obtained by a ma- 
jority classifier, a simple classifier that assigns each 
ambiguous word to the most frequent sense in a sam- 
ple. However, bear in mind that in unsupervised ex- 
periments the distribution of senses is not generally 
known. 

Perhaps the most striking aspect of these results 
is that, across all experiments, only the nouns are 
disambiguated with accuracy greater than that of 
the majority classifier. This is at least partially ex- 
plained by the fact that, as a class, the nouns have 
the most uniform distribution of senses. This point 
will be elaborated on in Section 6.1. While the choice 



of feature set impacts accuracy, overall it is only to 
a small degree. We return to this point in Section 
|6.2| . The final result, to be discussed in Section [T^, 
is that the differences in the accuracy of these three 
algorithms are statistically significant both on aver- 
age and for individual words. 
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McQuitty 


Ward 


EM 


chief 
common 
last 
public 


.861 
.842 
.940 
.683 


.844±.05 
.648±.12 
.791±.12 
.56O±.08 


.721±.01 
.513±.08 
.598±.09 
.450±.05 


729± 06 
.521±.00 
.9O3±.00 
.473±.03 


.831±.06 
.797±.04 


.611±.01 
.444±.04 
.659±.03 
.461±.03 


.646±.01 
.464±.06 
.909±.00 
.411±.03 


.856±.00 
799± 06 


.673±.03 
.561±.05 
.601±.08 
.488±.04 


697± 06 
543± 09 
.874±.07 
.507±.03 


.541±.ll 
.558±.07 


.636±.07 
.628±.05 


adjectives 


.832 


.711±.15 


.571±.12 


.657±.18 


.682±.15 


.544±.10 


.608±.20 


.730±.ll 


.581±.08 


.655±.16 


bill 

concern 
drug 
interest 
line 


.681 
.638 
.567 
.593 
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Figure 5: Experimental Results - accuracy ± standard deviation 



6.1 Distribution of Classes 



Bruce, 1997b ). Here, our best performance using a 



Extremely skewed distributions pose a challenging 
learning problem since the sample contains precious 
little information regarding minority classes. This 
makes it difficult to learn their distributions with- 
out prior knowledge. For unsupervised approaches, 
this problem is exacerbated by the difficultly in dis- 
tinguishing the characteristics of the minority classes 
from noise. 

In this study, the accuracy of the unsupervised al- 
gorithms was less than that of the majority classifier 
in every case where the percentage of the majority 
sense exceeded 68%. However, in the cases where 
the performance of these algorithms was less than 
that of the majority classifier, they were often still 
providing high accuracy disambiguation (e.g., 91% 
accuracy for last) . Clearly, the distribution of classes 
is not the only factor affecting disambiguation accu- 
racy; compare the performance of these algorithms 
on bill and public which have roughly the same class 
distributions. 

It is difficult to quantify the effect of the distri- 
bution of classes on a learning algorithm particu- 
larly when using naturally occurring data. In previ- 
ous unsupervised experiments with interest, using a 
modified version of Feature Set A, we were able to 
achieve an increase of 36 percentage points over the 
accuracy of the majority classifier when the 3 classes 
were evenly distributed in the sample (Pcdersen and 



larger sample with a natural distribution of senses 
is only an increase of 20 percentage points over the 
accuracy of the majority classifier. 

Because skewed distributions are common in lexi- 



cal work (Zipf, 1935), they are an important consid- 
eration in formulating disambiguation experiments. 
In future work, we will investigate procedures for 
feature selection that are more sensitive to minor- 
ity classes. Reliance on frequency based features, as 
used in this work, means that the more skewed the 
sample is, the more likely it is that the features will 
be indicative of only the majority class. 

6.2 Feature Set 

Despite varying the feature sets, the relative accu- 
racy of the three algorithms remains rather consis- 
tent. For 6 of the 13 words there was a single al- 
gorithm that was always significantly more accurate 
than the other two across all features. 

The EM algorithm was most accurate for last and 
line with all three feature sets. McQuitty's method 
was significantly more accurate for chief, common, 
public, and help regardless of the feature set. 

Despite this consistency, there were some observ- 
able trends associated with changes in feature set. 
For example, McQuitty's method was significantly 
more accurate overall in combination with feature 
set C while the EM algorithm was more accurate 
with Feature Set A, and the accuracy of Ward's 



method was the least favorable with Feature Set B. 

For the nouns, there was no significant differ- 
ence between Feature Sets A and B when using 
the EM algorithm. For the verbs there was no 
significant difference between the three feature sets 
when using McQuitty's method. The adjectives were 
significantly more accurate when using McQuitty's 
method and Feature Set C. 

One possible explanation for the consistency of 
results as feature sets varied is that perhaps the fea- 
tures most indicative of word senses are included in 
all the sets due to the selection methods and the 
commonality of feature types. These common fea- 
tures may be sufficient for the level of disambigua- 
tion achieved here. This explanation seems more 
plausible for the EM algorithm, where features are 
weighted, but less so for McQuitty's and Ward's 
which use a representation that does not allow fea- 
ture weighting. 

6.3 Disambiguation Algorithm 

Based on the average accuracy over part-of-speech 
categories, the EM algorithm performs with the 
highest accuracy for nouns while McQuitty's method 
performs most accurately for verbs and adjectives. 
This is true regardless of the feature set employed. 

The standard deviations give an indication of the 
effect of ties on the clustering algorithms and the 
effect of the random initialization on the the EM al- 
gorithm. In few cases is the standard deviation very 
small. For the clustering algorithms, a high standard 
deviation indicates that ties are having some effect 
on the cluster analysis. This is undesirable and may 
point to a need to expand the feature set in order to 
reduce ties. For the EM algorithm, a high standard 
deviation means that the algorithm is not settling on 
any particular maxima. Results may become more 
consistent if the number of parameters that must be 
estimated was reduced. 

Figures [| |?] and |^ show the confusion matrices 
associated with the disambiguation of concern, in- 
terest, and help, using Feature Sets A, B, and C, 
respectively. A confusion matrix shows the number 
of cases where the sense discovered by the algorithm 
agrees with the manually assigned sense along the 
main diagonal; disagreements arc shown in the rest 
of the matrix. 

In general, these matrices reveal that both the EM 
algorithm and Ward's method are more biased to- 
ward balanced distributions of senses than is Mc- 
Quitty's method. This may explain the better per- 
formance of McQuitty's method in disambiguating 
those words with the most skewed sense distribu- 
tions, the adjectives and adverbs. It is possible to 
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Figure 6: concern - Feature Set A 

adjust the EM algorithm away from this tendency 
towards discovering balanced distributions by pro- 
viding prior knowledge of the expected sense distri- 
bution. This will be explored in future work. 

7 Related Work 

Word-sense disambiguation has more commonly 



(Black, 1988 


,( 


Yarowsky, 1992) 


, ( 


Yarowsky, 1993), 


(Leacock, Towell, and Voorhees, 1993), (Bruce and 


Wicbe, 1994), ( 


Mooney, 1996), 


^Ng and Lee, 1996), 


(Pedersen, Bruce, and Wiebe, 1997), ( 


Pedersen and 


Bruce, 1997a 


)). However, all of these methods re- 



quire that manually sense tagged text be available 
to train the algorithm. For most domains such text 
is not available and is expensive to create. It seems 
more reasonable to assume that such text will not 
usually be available and attempt to pursue unsuper- 
vised approaches that rely only on the features in a 
text that can be automatically identified. 

7.1 Bootstrapping 

Bootstrapping approaches require a small amount 
of disambiguated text in order to initialize the un- 
supervised learning algorithm. An early example of 
such an approach is described in ( Hearst, 1991 ). A 
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Figure 7: interest - Feature Set B 
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Figure 8: help - Feature Set C 



supervised learning algorithm is trained with a small 
amount of manually sense tagged text and applied 
to a held out test set. Those examples in the test set 
that are most confidently disambiguated are added 
to the training sample. 

A more recent bootstrapping approach is de- 
scribed in flYarowsky, 1995| ) . This algorithm requires 
a small number of training examples to serve as a 
seed. There are a variety of options discussed for 
automatically selecting seeds; one is to identify col- 
locations that uniquely distinguish between senses. 
For plant, the collocations manufacturing plant and 
living plant make such a distinction. Based on 106 
examples of manufacturing plant and 82 examples of 
living plant this algorithm is able to distinguish be- 
tween two senses of plant for 7,350 examples with 97 
percent accuracy. Experiments with 11 other words 
using collocation seeds result in an average accuracy 
of 96 percent. 



While ( Yarowsky, 1995 ) does not discuss distin- 
guishing more than 2 senses of a word, there is no 
immediate reason to doubt that the "one sense per 



collocation" rule ( Yarowsky, 1993 ) would still hold 
for a larger number of senses. In future work we 
will evaluate using the "one sense per collocation" 
rule to seed our various methods. This may help 
in dealing with very skewed distributions of senses 



since we currently select collocations based simply 
on frequency. 

7.2 Clustering 

Clustering has most often been applied in natural 



language processing as a method for inducing svn- 



tactic or semantically related groupings of words 



(e.g., (Roscnfeld, Huang, and Schneider, 1969) 



(Kiss, 1973), (Ritter and Kohonen, 198SQ, ( 


Pereira, 


Tishby, and Lee, 1993 


), (Schiitze, 1993), ( 


Resnik, 



1995a| )). 

An early application of clustering to word-sense 
disambiguation is described in ( Schiitze, 1992 ). 
There words are represented in terms of the co- 
occurrence statistics of four letter sequences. This 
representation uses 97 features to characterize a 
word, where each feature is a linear combination of 
letter four-grams formulated by a singular value de- 
composition of a 5000 by 5000 matrix of letter four- 
gram co-occurrence frequencies. The weight associ- 
ated with each feature reflects all usages of the word 
in the sample. A context vector is formed for each 
occurrence of an ambiguous word by summing the 
vectors of the contextual words (the number of con- 
textual words considered in the sum is unspecified). 
The set of context vectors for the word to be dis- 
ambiguated are then clustered, and the clusters are 
manually sense tagged. 

The features used in this work are complex and 
difficult to interpret and it isn't clear that this com- 
plexity is required. (Yarowsky, 1995) compares his 
method to (Schiitze, 1992) and shows that for four 
words the former performs significantly better in dis- 
tinguishing between two senses. 

Other clustering approaches to word-sense disam- 
biguation have been based on measures of semantic 
distance defined with respect to a semantic network 
such as WordNet. Measures of semantic distance 
are based on the path length between concepts in a 
network and are used to group semantically similar 
concepts (e.g. (Li, Szpakowicz, and Matwin, 1995 )). 



(Resnik, 1995b) provides an information theoretic 
definition of semantic distance based on WordNet. 



( McDonald et al., 199C ) apply another cluster- 
ing approach to word- sense disambiguation (also 
see ( Wilks et al., 1990 )). They use co-occurrence 
data gathered from the machine-readable version of 
LDOCE to define neighborhoods of related words. 
Conceptually, the neighborhood of a word is a type 
of equivalence class. It is composed of all other words 
that co-occur with the designated word a significant 
number of times in the LDOCE sense definitions. 
These neighborhoods are used to increase the num- 
ber of words in the LDOCE sense definitions, while 



still maintaining some measure of lexical cohesion. 
The "expanded" sense definitions arc then compared 
to the context of an ambiguous word, and the sense- 
definition with the greatest number of word over- 
laps with the context is selected as correct. (Guthrie 



et al., 1991) propose that neighborhoods be subject 
dependent. They suggest that a word should po- 
tentially have different neighborhoods correspond- 
ing to the different LDOCE subject code. Subject- 
specific neighborhoods are composed of words hav- 
ing at least one sense marked with that subject code. 

7.3 EM algorithm 

The only other application of the EM algorithm 



to word-sense disambiguation is described in (Gale 
|Church, and Yarowsky, 1995[ ). There the EM algo- 



rithm is used as part of a supervised learning algo- 
rithm to distinguish city names from people's names. 
A narrow window of context, one or two words to 
either side, was found to perform better than wider 
windows. The results presented are preliminary but 
show an accuracy percentage in the mid-nineties 
when applied to Dixon, a name found to be quite 
ambiguous. 

It should be noted that the EM algorithm relates 
to a large body of work in speech processing. The 



Baum- Welch forward-backward algorithm (Baum 



1972) is a specialized form of the EM algorithm 
that assumes the underlying parametric model is a 
hidden Markov model. The Baum-Welch forward- 
backward algorithm has been used extensively in 
speech recognition (e.g. (Lcvinson, Rabincr, and 
|Sondhi, 1983] ), ( [Kupiec, 1992j )), ( pelinek, 199CQ ). 



8 Conclusions 

Supervised learning approaches to word-sense dis- 
ambiguation fall victim to the knowledge acquisi- 
tion bottleneck. The creation of sense tagged text 
sufficient to serve as a training sample is expensive 
and time consuming. This bottleneck is eliminated 
through the use of unsupervised learning approaches 
which distinguish the sense of a word based only on 
features that can be automatically identified. 

In this study, we evaluated the performance of 
three unsupervised learning algorithms on the dis- 
ambiguation of 13 words in naturally occurring text. 
The algorithms are McQuitty's similarity analysis, 
Ward's minimum-variance method, and the EM al- 
gorithm. Our findings show that each of these al- 
gorithms is negatively impacted by highly skewed 
sense distributions. Our methods and feature sets 
were found to be most successful in disambiguating 
nouns rather than adjectives or verbs. Overall, the 
most successful of our procedures was McQuitty's 



similarity analysis in combination with a high di- 
mensional feature set. In future work, we will inves- 
tigate modifications of these algorithms and feature 
set selection that are more effective on highly skewed 
sense distributions. 
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